TB case reports (Solutions to Exercises)¶
Data sets from http://www.who.int/tb/country/data/download/en/
In [36]:
library(tidyverse)
library(stringr)
library(forcats)
Exercise (Messy to tidy)¶
We start with a fairly messy version of the data set in tidyr::who
.
In [37]:
who <- tidyr::who
In [38]:
head(who)
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | ⋯ | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 1980 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1981 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1982 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1983 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1984 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Afghanistan | AF | AFG | 1985 | NA | NA | NA | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Ex 1*.
- Find the definition of the first 6 column names by doing a
semi_join
withdict
.
In [40]:
labels <- data.frame(name = colnames(who)[1:6])
labels
name |
---|
country |
iso2 |
iso3 |
year |
new_sp_m014 |
new_sp_m1524 |
In [42]:
dict_url <- "https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
if (!file.exists("dict.csv")) download.file(dict_url, "dict.csv")
dict <- read_csv('dict.csv')
Parsed with column specification:
cols(
variable_name = col_character(),
dataset = col_character(),
code_list = col_character(),
definition = col_character()
)
In [43]:
semi_join(dict, labels, by=c("variable_name" = "name"))
Warning message:
“Column `variable_name`/`name` joining character vector and factor, coercing into character vector”
variable_name | dataset | code_list | definition |
---|---|---|---|
country | Country identification | NA | Country or territory name |
iso2 | Country identification | NA | ISO 2-character country/territory code |
iso3 | Country identification | NA | ISO 3-character country/territory code |
new_sp_m014 | Notification | NA | New pulmonary smear positive cases: males aged 0-14 years (not used after 2012) |
new_sp_m1524 | Notification | NA | New pulmonary smear positive cases: males aged 15-24 years (not used after 2012) |
Ex 2.
- Create a new column called
notification
and gather all columns beginning wiithnew
there, such that their values are in a new column calledcases
. Remove NA values when using gather.
In [44]:
who1 <- who %>% gather(starts_with('new'),
key = 'notification',
value = 'cases',
na.rm = TRUE)
In [45]:
head(who1)
country | iso2 | iso3 | year | notification | cases |
---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0 |
Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30 |
Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8 |
Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52 |
Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129 |
Afghanistan | AF | AFG | 2002 | new_sp_m014 | 90 |
Ex 3.
- Replace all occurrences of the string
newrel
withnew_rel
In [46]:
who2 <- who1 %>% mutate(notification = str_replace(notification, "newrel", "new_rel"))
In [47]:
head(who2)
country | iso2 | iso3 | year | notification | cases |
---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0 |
Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30 |
Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8 |
Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52 |
Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129 |
Afghanistan | AF | AFG | 2002 | new_sp_m014 | 90 |
Ex 4.
- Split the
notification
column intonew
,type
,sex
,age
. (Note: There are no sex unknown data here)
In [48]:
who3 <- who2 %>% separate(notification, into = c("new", "type", "sexage"))
In [49]:
head(who3)
country | iso2 | iso3 | year | new | type | sexage | cases |
---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new | sp | m014 | 0 |
Afghanistan | AF | AFG | 1998 | new | sp | m014 | 30 |
Afghanistan | AF | AFG | 1999 | new | sp | m014 | 8 |
Afghanistan | AF | AFG | 2000 | new | sp | m014 | 52 |
Afghanistan | AF | AFG | 2001 | new | sp | m014 | 129 |
Afghanistan | AF | AFG | 2002 | new | sp | m014 | 90 |
In [50]:
who4 <- who3 %>% separate(sexage, into = c("sex", "age"), sep=1)
In [51]:
head(who4)
country | iso2 | iso3 | year | new | type | sex | age | cases |
---|---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new | sp | m | 014 | 0 |
Afghanistan | AF | AFG | 1998 | new | sp | m | 014 | 30 |
Afghanistan | AF | AFG | 1999 | new | sp | m | 014 | 8 |
Afghanistan | AF | AFG | 2000 | new | sp | m | 014 | 52 |
Afghanistan | AF | AFG | 2001 | new | sp | m | 014 | 129 |
Afghanistan | AF | AFG | 2002 | new | sp | m | 014 | 90 |
Ex 5.
- Drop the
iso2
,iso3
,new
columns.
In [52]:
who5 <- who4 %>% select(-iso2, -iso3, -new)
In [53]:
head(who5)
country | year | type | sex | age | cases |
---|---|---|---|---|---|
Afghanistan | 1997 | sp | m | 014 | 0 |
Afghanistan | 1998 | sp | m | 014 | 30 |
Afghanistan | 1999 | sp | m | 014 | 8 |
Afghanistan | 2000 | sp | m | 014 | 52 |
Afghanistan | 2001 | sp | m | 014 | 129 |
Afghanistan | 2002 | sp | m | 014 | 90 |
Exercise (Exploratory data analysis)¶
Ex 1.
- For each country compute the total number of cases of TB and save as
ex1
In [54]:
ex1 <- who5 %>% group_by(country) %>% summarize(count = sum(cases))
In [55]:
head(ex1)
country | count |
---|---|
Afghanistan | 140225 |
Albania | 5335 |
Algeria | 128119 |
American Samoa | 41 |
Andorra | 103 |
Angola | 308365 |
Ex 2.
- For each year, find the country that had the smear positive pulmonary TB. Display in chronological order.
In [56]:
who5 %>% filter(type == 'sp') %>%
group_by(year) %>%
filter(cases == max(cases)) %>%
select(year, country, cases) %>%
arrange(year)
year | country | cases |
---|---|---|
1980 | Canada | 186 |
1981 | Canada | 141 |
1982 | Canada | 150 |
1983 | Canada | 123 |
1984 | Canada | 169 |
1985 | Canada | 168 |
1986 | Canada | 147 |
1987 | Canada | 129 |
1988 | Canada | 131 |
1989 | Canada | 122 |
1990 | Canada | 100 |
1991 | Canada | 110 |
1992 | Canada | 79 |
1993 | Canada | 74 |
1994 | Canada | 87 |
1995 | China | 18306 |
1996 | China | 24057 |
1997 | China | 28247 |
1998 | China | 30093 |
1999 | China | 29328 |
2000 | India | 31090 |
2001 | India | 30007 |
2002 | India | 55829 |
2003 | India | 63587 |
2004 | India | 74450 |
2005 | India | 76870 |
2006 | India | 82939 |
2007 | India | 88045 |
2008 | India | 90498 |
2009 | India | 90830 |
2010 | India | 90440 |
2011 | India | 89706 |
2012 | India | 88111 |
3. Express the data for China as a data frame with just 3 columns -
year
, f
and m
where the values for f
and m
are the
sum of female and male cases in that year. Show just the 5 years with
highest total count of cases.
In [57]:
who5 %>%
filter(country == 'China') %>%
group_by(year, sex) %>%
summarize(count = sum(cases)) %>%
spread(key = sex, value = count) %>%
arrange(desc(f + m)) %>%
head(5)
year | f | m |
---|---|---|
2009 | 270530 | 613947 |
2010 | 268998 | 600094 |
2011 | 263933 | 601126 |
2012 | 265110 | 593751 |
2013 | 261049 | 586127 |
Ex 4.
- Make a scatter plot of the number of reports (each row is a report) of TB in each country with the regions on the y axis. Restrict to countries that begin with the letter A. Order the countries by number of reports and remove the y label.
In [58]:
options(repr.plot.width=4, repr.plot.height=3)
In [59]:
foo <- who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country) %>%
summarise(n = n())
foo %>%
ggplot(aes(n, reorder(country, n))) +
geom_point() +
ylab(NULL)
Data type cannot be displayed:

Ex 5.
- Display
ex1
as a bar chart. Because of the large number of countries, place the countries on the y-axis, and display countries from fewest to most cases. Scale the counts using a log10 scale, adding 1 to each count so we don’t have to deal with negative infinity.
In [60]:
options(repr.plot.width=6, repr.plot.height=30)
In [61]:
ggplot(ex1) +
geom_bar(aes(x=reorder(country, -count), y=1+count), stat='identity') +
xlab(NULL) +
ylab("Log10(TB cases)") +
scale_y_log10() +
coord_flip()
Data type cannot be displayed:

Ex 6.
- Make a bar chart of the number of reports (each row is a report) of TB in each country that begin with the letter A. Order the countries by number of reports increasing from left to right. Rotate the country labels so they are readable.
In [62]:
options(repr.plot.width=4, repr.plot.height=3)
In [63]:
who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)
Data type cannot be displayed:

Ex 7.
- Can you display as proportion instead of count?
In [64]:
who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev,
y = ..prop..,
group = 1)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)
Data type cannot be displayed:

Ex 8.
- Add a label showing the first 3 letters of the country label on top of each bar, and remove the x tick and axis labels. Make the bars 50% transparent and fill by prop using a color gradient from yellow to red.
In [65]:
ex8 <- who5 %>%
filter(str_detect(country, '^A')) %>%
mutate(country = country %>% str_sub(1, 3) %>% str_to_upper) %>%
group_by(country) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
In [66]:
ex8
country | n | prop |
---|---|---|
AFG | 244 | 0.056286044 |
ALB | 448 | 0.103344867 |
ALG | 224 | 0.051672434 |
AME | 172 | 0.039677047 |
AND | 387 | 0.089273356 |
ANG | 425 | 0.098039216 |
ANT | 346 | 0.079815456 |
ARG | 448 | 0.103344867 |
ARM | 461 | 0.106343714 |
ARU | 37 | 0.008535179 |
AUS | 826 | 0.190542099 |
AZE | 317 | 0.073125721 |
In [67]:
options(repr.plot.width=7, repr.plot.height=3)
In [69]:
ex8 %>%
ggplot(aes(x = reorder(country, prop),
y = prop,
fill = prop)) +
geom_bar(stat = 'identity', alpha = 0.5) +
geom_text(label = reorder(ex8$country, ex8$prop),
nudge_y = .02) +
xlab(NULL) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
scale_fill_gradient(low = "yellow", high="red") +
guides(fill = FALSE)
Data type cannot be displayed:

Ex 9.
- Make a bar chart of counts for male and female side by side with countries starting with A on the y axis
- Set the fill of the bar chart to pink for f and blue for m
- Add a meaningful title
In [70]:
options(repr.plot.width=6, repr.plot.height=6)
In [71]:
who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country, sex) %>%
mutate(count = sum(cases)) %>%
ggplot +
geom_bar(aes(x=reorder(country, count),
y=count, fill=sex),
position='dodge',
stat='identity') +
scale_fill_manual(values = c("f" = "pink", "m" = "blue")) +
labs(title="TB cases by region and sex") +
coord_flip()
Data type cannot be displayed:

Ex 10.
- Make a grid of plots where each column is a different sex), each row is a different country starting with A), and each subplot is a plot of trend over time. Allow the y - scales to vary across plots.Make the fitted curves for each sex have a different color.
In [72]:
options(repr.plot.width=6, repr.plot.height=15)
In [35]:
g <- who5 %>% filter(str_detect(country, '^A')) %>%
ggplot(aes(x = year, y=cases, color=sex)) +
geom_smooth() +
facet_grid(country ~ sex, scales="free")
suppressWarnings(print(g))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

In [ ]: